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Abstract 

Understanding human mobility is of vital importance for urban planning, epidemiology, and many 
other fields that draw policies from the activities of humans in space. Despite the recent availability 
of large-scale data sets of GPS traces or mobile phone records capturing human mobility, typically 
only a subsample of the population of interest is represented, giving a possibly incomplete picture of 
the entire system under study. Methods to reliably extract mobility information from such reduced 
data and to assess their sampling biases are lacking. To that end, we analyzed a data set of millions 
of taxi movements in New York City. We first show that, once they are appropriately transformed, 
mobility patterns are highly stable over long time scales. Based on this observation, we develop a 
supersampling methodology to reliably extrapolate mobility records from a reduced sample based on 
an entropy maximization procedure, and we propose a number of network-based metrics to assess 
the accuracy of the predicted vehicle flows. Our approach provides a well founded way to exploit 
temporal patterns to save effort in recording mobility data, and opens the possibility to scale up 
data from limited records when information on the full system is required. 


Introduction 

The increased pervasiveness of information and communication technologies is enabling the track¬ 
ing of human mobility at an unprecedented scale. Massive call detail records from mobile phone 
activities [1|[2] and the use of global positioning systems (GPS) in large vehicle fleets [3] for instance, 
are generating extraordinary quantities of positional and movement data available for researchers 
who aim to understand human activity in space. Other data sources, such as observations of ban¬ 
knote circulation nisi. online location-based social networks ElE], radio frequency identification 
traces mis, or even virtual movements of avatars in online games m have also been used as 
proxies for human movements. These studies have provided valuable insights into several aspects of 
human mobility, uncovering distinct features of human travel behavior such as scaling laws BUS] 
or predictability of trajectories [13] among others. Besides empirical studies, the surge of available 
data on human mobility has also evoked interest in developing new theoretical models of mobility at 
several scales. Such models have deep implications for various subjects ranging from epidemiology 
to urbanism DBMZI, with special importance in city planning and policy action 1181 . 

Despite these first success stories, the theoretical development of tools and techniques for han¬ 
dling massive data sets of human mobility and for assessing their possible biases is still a road full 
of obstacles. Existing models based on gravity [19] , radiation [20], intervening opportunities [2T1 . 
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etc. present a first step towards an accurate proxy for mobility at medium and large range scales, 
but they have been proven to be not always satisfactory to describe short scale movement such 
as intra-city displacements. The size of the data analyzed, the multiple scales involved, the highly 
skewed statistical nature of human activities [55] and the lack of strict control on the reliability of 
the data are just some of the multiple challenges this exciting new era poses. 

Although they are often extensive, one of the main limitations of data sets used in empirically 
driven urban-scale mobility research is the limited coverage of the entire population under study. 
For instance, cell phone data records are typically obtained from a single operator. Similarly, data 
from taxis, or from other vehicle fleets are typically obtained from a single company, which usually 
represents only a small fraction of the actual number of vehicles circulating in a city mm- In some 
scenarios, fully grasping a certain mobility-related phenomenon may require modelling the entire 
population of interest. For instance, it was shown that the fraction of taxi trips that can be shared 
in the city of New York is an increasing (albeit not simple) function of the number of daily taxi 
trips [24! . Hence, if a certain data set covers only a fraction of the daily taxi trips performed in a 
city, the taxi sharing potential cannot be fully unveiled. 

The above discussion motivates the need of extrapolating urban mobility data starting from a 
subset of the population of interest. Although a number of urban mobility studies have applied such 
methods [251126] , a definition and assessment of a statistically rigorous extrapolation methodology is 
so far lacking. Even the sub-problem of assessing the quality of urban movement models is to date 
open, since the skewness of the underlying statistical distributions m makes a set of consistent, 
quantitative indicators hard to develop. In this paper, we fill these gaps by introducing a rigorous 
methodology to tackle the problem of obtaining an accurate picture of a mobility process when 
only a limited observation of such a process is available, both in time and volume. We first propose 
a simple rescaling rule which allows to quantify the strong temporal regularity of urban mobility 
patterns, even at very fine scales such as trips between particular intersections. Exploiting this 
regularity, we use a maximum entropy approach combining empirical data to model the occurrence 
of the core of frequent trips with an exponential gravity model |27H29j accounting for the variation 
observed in the least-frequent trips. We apply our method to accurately reconstruct the data 
set of all taxi trips performed in the city of New York in the year 2011 using small fractions 
sub-sampled from only a month of recorded data. By analysing the temporal patterns and the 
topological properties of the yearly mobility of taxis represented as a multi-edge network, we can 
finally assess the statistical accuracy of the proposed supersampling methodology using a number 
of both information-theoretical and network-based performance metrics. 

The remainder of the paper is structured as follows: We first present the study of both the 
temporal and topological patterns observed in the data, which then allows us to construct a maxi¬ 
mum entropy method that exploits these features to solve the supersampling problem. Finally, we 
systematically test our reconstruction model on a very large data set and conclude by discussing 
some insights about the structure of urban mobility that the present study draws. 


Results 

Typically, mobility data is formalized by so-called Origin-Destination (OD) matrices, which are 
particular examples of weighted, or multi-edge networks [3111 ■ OD matrices represent the number 
of observed trips tij between the L = N 2 pairs of N locations or nodes i,j over a given observation 
period r. A location can be defined based on a spatial partitioning of the urban area, on points of 
interest [ 35 , or on road intersections [ 53 ] - as it is the case in the NY taxi data set at hand (see 
methods). Given this network representation, one can compute the total incoming s* n = ]Tb tij and 
outgoing strength s° ut = JT Uj of a node i. Throughout this paper, we define active nodes as the 
subset of nodes which are either origin or destination of at least one trip in the set of all recorded 
trips T = Eii tij and similarly active edges as the pairs of locations with at least one trip (tij > 1) 
recorded between them. The notation x shall refer to the observed value of the random variable 
x as derived from the data set, (x) to its expected value over independent realizations of a given 
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Figure 1. (a) Circadian clocks in the city: Time-series of the number of observed trips T, active 
edges E and active nodes in the city of NY per day over the year 2011. The influence of both seasonal 
fluctuations, major events and stable weekly patterns are clearly observed, (b) Aggregated fraction 
of active nodes and total trips as days of data are accumulated normalized by the total number of 
recorded nodes and trips at the end of the observation period, the evolution of the accumulated 
graph density in time E/L = E/N 2 (observed binary edges over total possible pairs of intersections) 
is also reported. Nodes are almost fully sampled within the first days analyzed, while edges are 
sampled sub-linearly in time, (c) Quantile - Quantile plot comparing the number of trips per day 
observed in the data set without outliers within two standard derivations of the mean (> 95% of the 
data) to the theoretical quantiles of a normal distribution and linear fit (dashed line) showing their 
proportionality (similar results not shown obtained for number of edges and nodes respectively). 


model, while x denotes the matrix or network average of the variable x across the full empirical OD 
matrix (for example, average graph-degree k). Finally, the symbol (x) T is used to express averages 
over time of variable x using bins of temporal length r. 

Stability of temporal urban-mobility patterns 

While the built structure of cities evolves slowly in time, many dynamic, behavioral processes that 
take place within a city unfold relatively fast, and in principle could be strongly variable across time. 
However, human activity in cities exhibits highly regular patterns when observed over well defined 
periods of time, such as circadian or weekly rhythms. Intra-urban mobility is a good example for 
such activities: With longer time spans or larger samples of gathering movement data in cities, 
the picture of the underlying mobility network will clear up continually, but stable patterns should 
already emerge with relatively few data points as we can see in Fig. [TJY. To systematically test 
this hypothesis, we make use of a fleet of taxis acting as probes, sampling from the total traffic of 
all vehicles in a city. The total number of recorded trips, or sampling size T(t), depends on the 
total observation time r and the number of probes, i.e., the size of the sub-population that is being 
monitored. 

The evolution of the sampling size of trips as a function of the observation period, Fig. [T|3, can 
be extremely well approximated by a linear relation ( R 2 > 0.999), indicating that the total number 
of trips generated daily in the city can be described as a random variable strongly concentrated 
around its mean value (T) r=lday ~ 403, 000 ± 61,000 (confidence bounds reported as standard 
deviation). 

On a yearly scale, the distribution of trips per day T t - i^ay is not statistically compatible to a 
Gaussian distribution mainly due to seasonal effects, see Fig. [TJA. The effects of summer and winter 
holidays are apparent, and of Hurricane Irene that hit New York city towards the end of August, 
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but if we disregard such outliers, corresponding to around 5% of the data that lies further than two 
standard deviations away from the mean, the quantile-quantile plot shows acceptable agreement 
with a normal distribution, see Fig.[Tp. 

To observe whether this strong regularity is also present in the finer structure of mobility, we 
must focus on each of the N nodes and L intersection pairs. Yet, we must find a suitable scaling 
to the data: The accumulated observed strength (both incoming and outgoing) Si(r) of each node 
and the weight of each intersection pair will increase as more and more data is gathered, but 

if we normalize their (in or out) strength and weight by the total number of observed trips in the 
period r, a strong regularity is recovered as we show in the following. The quantities p° ut ’ ln and 
pij , which quantify the relative importance of a given node and intersection pair compared to the 
overall network, 
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are extremely stable as shown in Table [l] We have split our data set into n T equal time intervals (on 
daily, weekly and four-week bases) and computed the relative dispersion of the values accumulated 
over the entire data set p(r = T max = lyear) around the measured values {p) T , 


p{r max ) - ( p{r)) T 

( P(i~)) T 


( 2 ) 


where T max is the time at the end of the full observation period and the averages are performed over 
all the time slices of length r. The graph-average of e is very close to zero and highly concentrated 
around this value for all the time windows considered (with a standard deviation of 13% in the 
worst case, decaying as sampling time is increased). 

Fig. [5] shows the correlation between the relative error and the relative importance of nodes and 
links. The fact that = 12iPi Ut = IZiPT = 1) coupled with second order seasonality effects 

induces an uneven distribution of errors: An overestimation of some values in the collection {{pij)} 
will forcefully induce an underestimation in some other values of the collection. Despite this issue, 
we can clearly see that the vast majority of the mass of relative errors is concentrated around zero 
(see points in background for Fig. [2]). 


Time windows n T 

PT : Sin (± std) 

Outliers 

P° s ut: tout (± std) 

Outliers 

Pij- Winter (i Std) 

Outliers 

347 (1 day period) 

-0.009 ±0.056 

0.039 

-0.037 ±0.095 

0.083 

0.018 ±0.127 

0.025 

51 (1 week period) 

-0.011 ±0.054 

0.041 

-0.040 ±0.095 

0.088 

0.014 ±0.100 

0.026 

13 (4 weeks period) 

-0.011 ±0.054 

0.041 

-0.040 ±0.095 

0.087 

0.012 ±0.061 

0.025 


Table 1. Variability of node and node-pair statistics (incoming and outgoing p° ut relative 
strength, and relative number of trips between intersections pij) using different temporal granularity 
of the data set averaged over the full network (nodes and pairs of nodes respectively) compared to 
final yearly values. Time units with a total number of trips at least two standard deviations apart 
from the adjusted yearly mean have not been considered in the average to account for seasonal 
variations (< 5% of the data in the worst case). For the pairs of intersections ij, only pairs with 
at least one non-zero appearance on the time slicing have been considered for the average. The 
fraction of data with absolute relative error larger than two standard deviations is also reported as 
Outliers. 


To some extent, we would expect the node strength to be stable over time, since the number 
of trips received and generated at each location depends on parameters such as population density, 
number of points of interest present in a given location, etc. [32], whose evolution is given by much 
slower dynamics than the mobility process studied herein. But additionally, time stability is also 
observed at the trip level between intersections, yet in this case the analysis displays a higher 
variability around the mean - see Fig. [5]4 and Table |T] This higher variability can be explained by 
the different sampling processes: While the percentage of active nodes becomes extremely stable 
already when just a very small number of days is considered (Fig. [1)3), this is not the case for 
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Figure 2. The effect of sampling on node and intersection pair temporal stability. Correlation 
between measured values of intersection pair (py) (a) and node statistics (p l s n ) (b), (p° ut ) (c) for 
different time slices and relative dispersion around the mean (equation ([2jl) for the yearly aggregated 
data. Error bars represent standard deviations on the log-binned data. Raw data for the daily case 
is shown in the background. For visual clarity, panel a) only shows a random subsample of 1/1000 
of the original points. 





the total number of active edges, because sampling of edge-specific attributes requires to grow as 
L ~ N 2 to achieve a comparable level of accuracy. 

The anatomy of urban flows 

The results on temporal stability indicate that any model aiming to reproduce human mobility 
at urban scales should consistently exhibit regularities as reported above. Having seen that the 
mobility patterns are stable across time, an understanding of the main topological aspects of the 
aggregated static picture is further needed in order to be able to select the main features our 
methodology should aim to reproduce. The most relevant topological aspect of the mobility network 
is the highly skewed concentration of taxi pick-ups and drop-offs across the city, which gives rise 
to heavy-tailed node-strength distributions, Fig. [3J\ (other general metrics are reported in the 
methods section). Therefore, to test whether this relevant property alone already captures the 
essential features of the mobility network, we must consider a null model which randomizes the 
considered network keeping constant the strength of each node -the multi-edge configuration model 
(see methods section or [30] for more details). When comparing the null model with the data, we 
observe significant deviations showing the importance of certain places or nodes in the network. 
In other words, even the strong heterogeneity of the distribution of strengths cannot account for 
the skewness of the weight distribution: Additional factors add many more trips between some 
connections than there should be under random conditions. The empirical link weight distribution, 
blue line in Fig. [3)3, is more skewed than under a random allocation of trips (Configuration model), 
dashed line, and the connections at the node level show a clear assortative correlation instead of a 
flat profile, see Fig. [3p. This occurs despite the fact that the average number of trips between the 
most busy locations can be characterized by the relation t oc s° ut s™ with a reasonable accuracy, see 
Fig. [3jC, coinciding with the configuration model. These insights indicate that the distribution of 
node strengths across the city has a strong influence on the topology of the network, and needs to 
be taken into account when modelled, but needs extra ingredients to fully account for the observed 
pattern of connections. 

A flexible model to reproduce human mobility 

A general maximum entropy based theory for model generation (see for extended discus¬ 

sion and references) allows us to efficiently exploit both the observed temporal stability features 


5 









































Figure 3. Empirical network features of the taxi multi-edge mobility network, (a) Distribution of 
incoming and outgoing strengths s. (b) Existing edge weight complementary cumulative distribu¬ 
tion function compared with a configuration model [30] and the supersampled model with / = 0.1. 
(c) Weighted average neighbor out-strength s™ n as function of node out-strength (similar results for 
incoming strengths not shown). (d) Graph-average existing weight as a function of product of out¬ 
going and incoming strengths of origin and destination for a single instance of the network. Points 
represent mean values and error bars represent standard deviations computed using log-binning 
and distributions also shown using log-binning and ensemble results averaged over 100 repetitions 
of the two models. 


and the heterogeneous topological properties of the network to solve the supersampling problem 
at hand. It starts with the assumption that each intersection pair in the network is allocated a 
constant fraction pij of the total sampling (equation (JTJ|). Under this condition, and assuming that 
the mobility process is driven by some general constraints, such as population density or budget, it 
can be proved that for any desired level of sampling Td the statistics of trips for each pair of nodes 
can be well described by a set of L independent Poisson processes with mean (t t j ) = Td (pij), with 

<i ’« )s ™(n(r) = T 1 ^(r) < 3) 

' l 3 

Following the theory, it would seem clear that from knowing the real values of the collection 
{ ['Pij )} , supersampling a mobility data set would be a trivial operation of generating L independent 
Poisson processes using the provided proportionality rule. Therefore, the problem now reduces to 
inferring the collection of values {p^} from an available data set. We shall assume that only one 
snapshot of the aggregated mobility network is available to this end (thus assuming no temporal 
information is available on the trip data) as is usually the case in mobility studies. The maximum 
likelihood estimation of such values corresponds to 

(Pi3)ML = ^rr=&j(T)- ( 4 ) 

T(t) 

There is, however, a practical issue in this formula related with the normalization condition for the 
random variables {pij} and the presence of empty intersection pairs in the available observed data. 
For such intersections, using the formulas above, we have that p^ = tij = 0, whereas their real 
value is unknown but fulfils ( p^j ) £ [0 ,p m in ~ T~ x ], Since by definition both collections {pij} and 
{pij} need to be normalized, and denoting the set of active edges as £ = {ij\Uj > 0}, we have, 

yi (pij) = i = yi ( Pij ) + (p^) y^p^ = yi p%j +o= 1 , (5) 

ij ij\ije£ ij\ij<££ './ i.i i.H £ 

from which we see that Yhij\e£ (P*j) < J2ij\e£Pij = 1- 

Hence, in general we cannot consider the empirically observed probabilities p^ as a good proxy 
for the real values of p^ unless the number of empty intersection pairs is very reduced. Given that 
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the percentage of active edges (pairs of nodes for which tij > 0 ) is a very slowly increasing function 
of the sampling, see Fig. [ljB, inferring directly the set of probabilities {pij} empirically would take 
an enormous data set - note that even with over a year of data only roughly 40% of edges are 
covered. 

For the reasons given above, a simple proportionality rule using equation Q is not a good 
supersampling strategy, specially for skewed and sparse data sets. 


Supersampling urban trips 

Based on the previous discussion, we now present the methodology for supersampling an urban 
mobility data set that consists in inferring the collection of {{p^)} values from a set of aggregated 
empirical trips { t ^}. We should do so bearing in mind that despite the maximum likelihood formula 
in equation dj) cannot be directly used for the empty intersection pairs in the data, it does perform 
well for non-empty intersections (see Fig. [2j. 

The maximum entropy based framework naturally allows such a procedure, since it can combine 
any constraint driven model with the rich information encoded in the trip sample. We propose a 
method to predict trips based on the theory mentioned earlier: Taking the L intersection pairs 
(being them active edges in the data set or not), we split them into two parts, the subgroup of 
trusted trips defined as Q = {ij\Uj > t m in} and its complementary part Q c . The value t m i n is a 
threshold modelling a minimal statistical accuracy that depends on the amount of data available, 
and which may be set to 1 in practical applications. We keep the proportionality rule pij oc tij for 
the trusted trips, while for the remaining trips we apply a doubly constrained exponential gravity 
model -other maximum entropy models [ 251 could also perform well as long as they preserve the 
outgoing and incoming strength. In other words, we generate a collection of {( Pij )} values, 


(Pij) = 


= J -f = Pij ij G Q 

Xiyje ~ lc « ij € Q c 


( 6 ) 


The values 7 and { 27 , yj} are the 2 N + 1 Lagrange multipliers satisfying the following equations 


s°i Ut ~T Y Pij 

= fxi Y y> e ~ 7Cij 

A ij&Q 

i\ij&Q c 

*T- f E ftj 

= f Vj Y x ^° ij 

j\ijeQ 

j\ij&Q c 

G — T 1 ' Cij pij 

= T Y CijXiijje~ 

ij\ijeQ 

ij\ij£Q c 


ICij 


( 7 ) 


where C = JW UjCij is the total euclidean distance of the observed trips (cij stands for the distance 
between intersections i and j). Note that, by construction, the values are properly normalized, i.e., 

E»j (Pij) = 11, + HijtQC x iyj e ~ 1Cij = !• 

The model presented earlier needs to deal with the issue of inactive nodes that do not appear 
in the original data due to poor sampling, i.e., nodes for which s = 0 either incoming or outgoing 
for some observation period r. This issue has a minor impact in our case due to the previously 
observed rapid coverage of the number of active nodes (the number of inactive nodes is negligible 
after accumulating very few days of data, see Fig. [TJ3) . In any case, it can be solved easily: given 
that the geographic positions of the nodes are available, we could always artificially assign a certain 
relative strength to the nodes not present in the data using complementary call detail records [34| . 
census data or points of interests (POI) data, or assign them some values according to a chosen 
distribution depending on the data at hand. For simplicity, in our case we have chosen to keep only 
the nodes present in the original data. 
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Assessing the quality of the supersampling methodology 

To test the supersampling methodology, we have proceeded to select a timespan of our data set 
corresponding to an observation period of r = 1 month (February 2011) from which we further ran¬ 
domly sub-sample different fractions / used as training sets to compute { (p l:] )} applying equations 
© and ©. We then reconstruct the OD using the proportionality rule {(Uj) = Td (pij)} for both 
the complete and reduced data set, Td = T(t' = 1 year) and Td = T(t = 1 month). Finally, we 
compare the model predictions with the set of empirically observed trips in these periods. 

In order to do so, we need to introduce metrics to quantify the resemblance between model 
predictions and actual recorded values. Commonly used indicators are the Sorensen-Dice common 
part of commuters (CPC) value [J5] or the linear fit of {tij) model vs tij [20]. However, the skewness 
of the observed trip distribution represents a challenge to these indicators: The variability of low¬ 
valued trips induces notable instabilities on both and being single numbers, they are only able to 
provide a limited picture on the precision of a given model to reproduce empirical results. 

To overcome these issues, we propose a slight modification to both the Sorensen CPC index and 
the coefficient of determination R 2 from the fit (tij) model vs tij (see methods) and we additionally 
propose the introduction of a number of network metrics to precisely assess the quality of the models 
used in human mobility at the topological, finer, scale: The unweighted degrees of the nodes (k) (s), 
the weighted neighbor strength correlation (s™ n ) (s), existing trip distribution P(t) and number of 
existing trips as a function of the origin destination strength product t(ss'). 



r = 1 month 

r 

= 1 year 


/ 

fa 

•£/ £Emp 

CPC 

jd‘2 

^cond 

fa 

CPC 

jd‘2 

cond 

1.00 

0.8855 

1.46 

0.92 

1.00 

0.07090 

0.78 

0.91 

0.75 

0.6417 

1.64 

0.83 

0.98 

0.05138 

0.76 

0.89 

0.50 

0.4014 

1.88 

0.77 

0.94 

0.03214 

0.74 

0.86 

0.25 

0.1711 

2.26 

0.68 

0.83 

0.01370 

0.69 

0.78 

0.10 

0.0492 

2.64 

0.60 

0.65 

0.00394 

0.65 

0.63 

0.01 

0.0012 

- 

0.58 

0.21 

0.00010 

0.66 

0.26 

0.005 

0.0003 

- 

0.59 

0.12 

0.00002 

0.66 

0.18 

Configuration 

- 

- 

0.57 

-0.87 

- 

0.64 

-0.22 

Empirical 

1 

1 

1 

1 

1 

1 

1 


Table 2. Parameters for the validation of the methodology. See details on each indicator in the 
main text and in the methods section. The number of trusted trips fed to the model relative to the 
entire number of generated trips /q = Ylij\ijeQ /^d( r ) * s reported in column 3. The Supersampled 
models with different fractions / are only generated using subsamples of the training set (1 month 
observation period). Empirical stands for the model generated using the empirical probabilities pij 
(equation JTJ) of the full data set and Configuration stands for the multi-edge configuration model 
applied to the full data set. 


The results for the supersampling method are summarized in Tableland a specific example for 
/ = 0.1 (reconstruction using only 10% of the original data of the monthly data set compared to 
yearly data) is shown in Fig. [I] for the different indicators proposed. For comparison, results using 
both a configuration model and the empirical values {pij} using the information encoded in the 
complete dataset are also shown. 

We observe an accurate reconstruction of the mobility network for a wide range of values of /, 
which shows the validity of our proposed supersampling methodology. At the global scale, even at 
extreme levels of subsampling, our model is successful at reconstructing the original dataset. Also 
at the topological scale, despite the heterogeneities in the underlying distributions, the methodology 
generates very accurate predictions. The predictions for the least frequently visited nodes display 
higher relative errors due to the presence of inactive nodes in the training dataset (1.6% of total 
nodes for / = 0.1). 

Upon close inspection, our inferred values {(pij)} slightly over-estimate low-valued weights and 














Figure 4. Main network differences between real data accumulated over a year and Supersampled 
model from real one month data with / = 0.1 subsampling, (a)-(d) Relative error (((*} — x)/x 
with x being a magnitude measured from the aggregated yearly network) between reconstructed 
network using supersampling and original data for outgoing degrees (a), strengths (b) and average 
neighbor strength (d) (similar results for incoming direction not displayed). The complementary 
cumulative distribution function of both edge lengths and trips lengths (c) is also shown, (e) 
Comparison between empirical {hj} values and model prediction over a single run. Configuration 
model expectation from a single run using the full year data set is also shown for comparison. All 
results averaged over 100 repetitions of the model, error bars represent standard deviations on the 
log-binned data and raw data is shown in the background. For visual clarity, panel e) only shows 
a random subsample of 1% of the raw data in background. 



underestimate large-valued weights and strengths as was expected from the analysis in temporal 
stability (see Fig.[2|), yet the errors are greatly mitigated as we can see in Fig. 03. See Figs. 03 (green 
line) and HP (green dots) where we can observe a gap around t ~ 100 and ) ~ 100, respectively, 
which corresponds to the separation between the trusted empirical data (separated points in the 
background belonging to the group of trusted trips Q) and the reconstructed trips (clustered cloud 
of points). The minor seasonal fluctuations detected first in our temporal analysis together with 
these over- and under-estimations explain the minor limitations of the model to reproduce perfectly 
the entire yearly data set. 

The second order effects induced by the seasonality of recorded data can also be seen in the 
performance of our methodology under extreme levels of subsampling (using around 1% of the 
sample monthly data to feed the model or less). In these circumstances, the model is still able to 
produce a good prediction of the empirical data, yet it reproduces better the accumulated yearly 
mobility rather than the monthly one since the inherent seasonal variations of traffic between certain 
intersection pairs are smoothed by the aggregation procedure. 

Furthermore, in the event that enough historical data were available, we could achieve even 
better results by computing the collection {{Pij) T } with an appropriate t period (depending on the 
granularity of the data) and approximating (p-ij ) ~ ( Pij) for the group of trusted trips (equation 
©)■ Such a procedure, which may be extended to overcome the minor limitations imposed by the 
seasonality of the data and other improvements related with the presence of non-active nodes could 
be derived to perfect the method. 


Discussion 

The stationarity of the temporal statistics of trips between different locations, together with a suit¬ 
able scaling for the data observed over different timespans, has allowed us to develop a general 
model that is highly effective at reconstructing a general mobility scenario from very limited ag- 
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gregated data, without the need of having fine grain temporal information, and provides insights 
about the structure of real taxi trips. The success of our reconstruction method, even using very 
small amounts of data, points out the composite structure of the network of urban mobility: Taxi 
displacements are characterized by a small core of very frequent trips coupled with trips generated 
at random but conditioned by the structural constraints of the city such as population distribution 
and mobility costs. 

Ultimately, the results presented in this paper could be used to answer questions that are of 
fundamental importance in the field of human mobility modelling, such as: i) Can an accurate 
picture of urban mobility patterns be obtained from an incomplete sample of the population?, and 
ii) are existing metrics sufficient to assess the quality of model predictions? 

We pointed out the importance of data sampling and of the correct assessment of mobility 
models, and introduced network-based tools to evaluate such models. The implications of these 
findings are two-fold: On the one hand, the stationarity of the temporal patterns could be exploited 
to save space and effort in recording mobility data. On the other hand, our method opens the 
possibility of efficiently scale up data from reduced fleet of vehicles in cases where a full knowledge 
of the system is needed. 

Our study provides a first step in showing that incomplete samples can indeed be scaled up 
adequately with the appropriate models, and that network metrics are required to comprehensively 
assess mobility model predictions. 


Materials and Methods 

Data set 

OD matrices are typically inferred from either census/survey data or alternative means such as 
social media data or Call Detail Records (CDR) (35]. They must be constructed in terms of trips, 
i.e. well defined trajectories between a starting and ending point. For this reason, we use the taxi 
data for its completeness, since all trips are recorded, and we consider it a good proxy for general 
urban mobility, with the caveat that commuting patterns and areas with little taxi demand are 
covered less. For the present paper we have used the full set of taxi trips (with customers) with 
starting and ending points within Manhattan obtained from the New York Taxi and Limousine 
Commission for the year 2011 via Freedom of Information Law request (24|. We have aggregated 
the data at the node level taking into account the grid of roads up to secondary level and adding 
trips to the nearest node (intersection) in a radius of 200m. We have decided to keep the self-loops 
present in the data for simplicity (albeit their fraction is completely negligible). In the analysis, 
all trips including both week-ends and week-days are considered, since the pattern for weekly trips 
shows a continuous increase in the number of trips peaking on Friday and followed by a sudden 
drop on Sundays. 

For the subsampling of the reduced data set used as basis to reconstruct the networks, we have 
used random subsampling. The parameters of the obtained mobility networks for two different 
observation periods r are reported in Table [3] 

The analyzed data shows that most of the taxis share similar performance. This, coupled with 
the fact [25j that individual taxi mobility traces are in large part statistically indistinguishable from 
the overall population, justifies that their individual traces (corresponding to sets of trips performed 
by different customers which can be considered as independent events) can be safely aggregated for 
the analysis. 
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2434 ± 1801m 
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4091 

7251605 
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35929 

51325 

54877 

1773 

799 

1206 

8.8 

0.43 

60.84 

20.27 

2458 ± 1831m 


Table 3. Empirical network parameters. Symbol x stands for graph averaged magnitudes. See 


main text and methods for a definition of each magnitude. 
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Null model: The Multi-Edge Configuration model 

Cities usually display a high level of variability across its different locations in terms of activity, 
i.e., city center concentrate busy locations while outskirts usually display less traffic/retails areas 
and others. At the level of networks, this translates in important topological heterogeneities which 
need to be accounted for using a suitable null model. The Multi-Edge configuration model [3D] is a 
maximum entropy model that is used to generate maximally random network instances of graphs 
with a prescribed strength sequence {s m ,s out } (number of trips entering and leaving each node). 
The expected number of trips between two nodes i and j with respective strengths s° ut and s™ 

reads = s° ut sJ l /T and the assortativity profile is flat (nodes are uncorrelated at the level 

of strengths). Throughout this paper, we use this null model as benchmark preserving the strength 
sequence obtained from aggregating the complete yearly observation period. 


Indicators for the quality of the reconstruction 

Distance based measures 


• Sorensen-Dice common part of commuters index: This indicator was proposed in [35] 
and based on the original formulation is defined as 


CPC. 


sample 


2E, jeg niin(4-,f-^) 

E X . | V'' 4-model 

ij L ij ' L ij 


( 8 ) 


We propose an alternative version formulated in terms of averages which reads, 

cpc = 2E^ £ min(4-,fa)) ^ . 

Eij Xj + Ylijes ftij) 

The different versions of this indicator have values in the range [0,1], where CPC = 1 indicates 
total coincidence between data and model and CPC = 0 total disagreement. However, for 
sparse data sets with a skewed distribution of {Uj} values, equation © may return values 
excessively lower than 1, even for models very close to reality. To exemplify this fact, Table [D] 
shows a comparison of the performance of the two indicators for the models presented in 
Table [2] For the Empirical model, we can see that the second version of the indicator recovers 
values close to 1 as would be expected. Furthermore, both indicators converge to very similar 
values as sampling is increased (the yearly data set contains roughly 12 times more trips than 
the monthly one). 



r = 1 month 

T = 1 

year 

/ 

<■ CPC sample ) 

CPC 

(CPC sampl 

e) CPC 

1.00 

0.786 

0.92 

0.777 

0.78 

0.75 

0.757 

0.83 

0.758 

0.76 

0.50 

0.713 

0.77 

0.731 

0.74 

0.25 

0.639 

0.68 

0.686 

0.69 

0.10 

0.562 

0.60 

0.647 

0.65 

0.01 

0.539 

0.58 

0.655 

0.66 

0.005 

0.543 

0.59 

0.655 

0.66 

Configuration 

0.525 

0.57 

0.641 

0.64 

Empirical 

0.829 

1 

0.936 

1 


Table 4. Values for the different versions of the common part of commuters index for the re¬ 
constructed models. ( CPC sam pie ) and averaged trip values computed over 1000 repetition of each 
model, standard deviations lower than 10 -3 for all cases. Note how differences between indicators 
disappear with increased sampling. 
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• Linear correlation (tij) model vs f y : This method is widely used [2(3]. We report the 
coefficient of determination R^. ond in all tables, based on the comparison between real data 
and conditional values of the model on the existing edges (since we are using a biased statistic 
based only on the observed trips in the original data, not the entire set of intersection pairs). 
With Poisson distributed variables with mean (t l3 ), such conditioned average reads, 


(ft) 


0 <ty) = 0 


( 10 ) 


which converges to the average value for highly used trips. Hence, the coefficient of determi¬ 
nation Florid assuming an identity relation (fy) ~ t data is explicitly, 



Network measures To better grasp the quality of the models, we propose to compare also some 
of its multi-edge network related quantities: 

• Degree: The unweighted degree of the nodes is the sum of their incoming/outgoing active 

edges (edges for which iy > 0), k x = 0(ty) being E = 0(ty) the total number of 

active edges and x referring to the outgoing i or incoming direction j. 

• Average weighted neighbor strength: This metric is widely used in the literature. It 

indicates the level of correlations at the node level and is defined as s™ n (s x ) = ; where 

y is the complementary of a; (if x = i then y = j and vice versa). 

• Distribution of weights on existing edges: This commonly used measure is computed 
as P(t) = fitdijE and indicates the collection of weight values present in the network, 
where 6 Xty corresponds to the Kroenecker delta. 

• Graph average existing weight of trips as a function of product of incoming 
and outgoing degree: To quantify the deviation from a completely randomized configu¬ 
ration model, we compute also this metric, which is the average weight of existing trips as 
a function of the product of out(in) strengths of their origin (destination) nodes: t(ss') = 
Syb= s ' i=s ^ij! n ss 'where n ss ' is the cardinality of the sum. For the configuration case, this 
magnitude is equivalent to (t + ) oc 1 _ e _ s as //T ■ 

Information values We also assess the quality of our models using their Log-Likelihood values 
assuming a set of independent Poisson random variables with known means (tij) for each intersection 
pair, 

In j e 

ij \ 

Incompatible Loglikelihood values are not reported in tables (such as cases where (tij) = 0 < ty or 
I tij — (tij) I »0). 

Simulations 

All the simulations and solving of saddle point equations as well as the analysis of the multi-edge 
networks have been performed using the freely available, open source package Origin-Destination 
Multi-Edge analysis (ODME) [57] . 


£ = InP({tlj}\ {(t^)}) = Y, 


fV • ( 12 ) 

tij ! J 
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